Table of contents

  • Logs for the methylation data app
    • Introduction
      • 21-11-2024
    • Loading in the data
      • 21-11-2024
    • Data exploration
      • 22-11-2024
      • 25-11-2024
    • BED file annotation
      • 26-11-2024
      • 28-11-2024
      • 03-12-2024
      • 04-12-2024
    • Plotting / visualisation testing and ideas
      • 09-12-2024
      • 10-12-2024

Logs for the methylation data app¶

Introduction¶

21-11-2024¶

This logbook will describe the process of creating visualisations, ideas. These visualisations and ideas will be used to create an application for research students. This application will take DNA methylation data as input. This app will make it easier for the students to look into their generated data, and it will help them with understanding their data.

Loading in the data¶

21-11-2024¶

I would like to combine the data from all the files into one single dataframe, with the id in the column of the df. This way i could compare different conditions to eachother.

The first code-block is to load in the used libraries.

InĀ [1]:
import os
import seaborn as sns
import matplotlib.pyplot as plt
import polars as pl
import numpy as np
import datashader as ds
import datashader.transfer_functions as tf
from Bio import SeqIO
import hvplot.polars
import pandas as pd
import re
import plotly.express as px
InĀ [2]:
barcodes_names: pl.dataframe = pl.read_csv("/commons/Themas/Thema06/Methylatie/barcodes.csv")

barcodes_names = barcodes_names.with_columns(controle_n = pl.int_range(pl.len()).over(" description")+1)
barcodes_names = barcodes_names.with_columns(group_and_n = pl.concat_str([pl.col(' description'), pl.col("controle_n")]))

barcodes_names = barcodes_names.with_columns(pl.col(pl.Utf8).str.strip_chars()).drop("controle_n")

print(barcodes_names)
shape: (5, 3)
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ barcode ┆  description        ┆ group_and_n          │
│ ---     ┆ ---                 ┆ ---                  │
│ i64     ┆ str                 ┆ str                  │
ā•žā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•”
│ 11      ┆ Jurkat_DMSO_control ┆ Jurkat_DMSO_control1 │
│ 12      ┆ Jurkat_betuline     ┆ Jurkat_betuline1     │
│ 13      ┆ Healthy_control     ┆ Healthy_control1     │
│ 14      ┆ Jurkat_betuline     ┆ Jurkat_betuline2     │
│ 15      ┆ Jurkat_DMSO_control ┆ Jurkat_DMSO_control2 │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

This generates a data frame that contains the barcode and also the description of the barcode The column called group_and_n contains the description with a control group number

This is needed to label the different groups in the df that will contain all of the data

Which will be loaded in the code below this block

InĀ [3]:
path: str = "/commons/Themas/Thema06/Methylatie/analysis"
def load_files(path: str) -> pl.dataframe:
    resulting_df: pd.DataFrame = pl.DataFrame(
        {"chr":[],
         "start":[],
         "end":[],
         "frac":[],
         "valid":[],
         "group_name":[]}
    )
    files: list[str] = os.listdir(path)

    for file in files:
        if os.path.isfile(f"{path}/{file}") and file.endswith("methylatie_ALL.csv"):
            temp_df: pd.DataFrame = pd.read_csv(f"{path}/{file}", sep="\t")
            temp_df: pl.DataFrame = pl.from_pandas(temp_df)
            barcode_num: list[int] = re.findall(r"\d+", file)

            name_group: str = barcodes_names.filter(pl.col("barcode").cast(pl.String) == barcode_num[0]).select("group_and_n")
            temp_df: pl.DataFrame = temp_df.with_columns(pl.lit(name_group).alias("group_name"))
            resulting_df = pl.concat([temp_df, resulting_df])
    
    return resulting_df
    
df: pl.DataFrame = load_files(path=path)

All of the csv files are now loaded into 1 polars dataframe

InĀ [4]:
print(df.head())
shape: (5, 6)
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ chr  ┆ start ┆ end   ┆ frac ┆ valid ┆ group_name       │
│ ---  ┆ ---   ┆ ---   ┆ ---  ┆ ---   ┆ ---              │
│ str  ┆ i64   ┆ i64   ┆ f64  ┆ i64   ┆ str              │
ā•žā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•”
│ chr1 ┆ 10468 ┆ 10469 ┆ 1.0  ┆ 1     ┆ Jurkat_betuline2 │
│ chr1 ┆ 10470 ┆ 10471 ┆ 1.0  ┆ 2     ┆ Jurkat_betuline2 │
│ chr1 ┆ 10488 ┆ 10489 ┆ 1.0  ┆ 2     ┆ Jurkat_betuline2 │
│ chr1 ┆ 10492 ┆ 10493 ┆ 1.0  ┆ 2     ┆ Jurkat_betuline2 │
│ chr1 ┆ 10496 ┆ 10497 ┆ 1.0  ┆ 1     ┆ Jurkat_betuline2 │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Data exploration¶

22-11-2024¶

I now have a data frame with the methylation data with a column called group_name that holds the name of the group of which the data comes from

InĀ [5]:
test: pl.DataFrame = df.filter(
    pl.col("group_name").is_in(['Healthy_control1', 'Jurkat_betuline1', 'Jurkat_betuline2'])
)
test: pl.DataFrame = df.filter((pl.col("start") >= 60778131) & 
                   (pl.col("end") <= 60778731) & 
                   (pl.col("chr") == "chr10"))
print(test)
shape: (83, 6)
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ chr   ┆ start    ┆ end      ┆ frac ┆ valid ┆ group_name       │
│ ---   ┆ ---      ┆ ---      ┆ ---  ┆ ---   ┆ ---              │
│ str   ┆ i64      ┆ i64      ┆ f64  ┆ i64   ┆ str              │
ā•žā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•”
│ chr10 ┆ 60778212 ┆ 60778213 ┆ 0.0  ┆ 1     ┆ Jurkat_betuline2 │
│ chr10 ┆ 60778217 ┆ 60778218 ┆ 0.0  ┆ 1     ┆ Jurkat_betuline2 │
│ chr10 ┆ 60778237 ┆ 60778238 ┆ 0.0  ┆ 1     ┆ Jurkat_betuline2 │
│ chr10 ┆ 60778258 ┆ 60778259 ┆ 0.0  ┆ 1     ┆ Jurkat_betuline2 │
│ chr10 ┆ 60778283 ┆ 60778284 ┆ 0.0  ┆ 1     ┆ Jurkat_betuline2 │
│ …     ┆ …        ┆ …        ┆ …    ┆ …     ┆ …                │
│ chr10 ┆ 60778682 ┆ 60778683 ┆ 0.0  ┆ 3     ┆ Healthy_control1 │
│ chr10 ┆ 60778689 ┆ 60778690 ┆ 0.0  ┆ 3     ┆ Healthy_control1 │
│ chr10 ┆ 60778700 ┆ 60778701 ┆ 0.0  ┆ 3     ┆ Healthy_control1 │
│ chr10 ┆ 60778707 ┆ 60778708 ┆ 0.0  ┆ 3     ┆ Healthy_control1 │
│ chr10 ┆ 60778723 ┆ 60778724 ┆ 0.0  ┆ 3     ┆ Healthy_control1 │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

This df contains all methylation data between the range of 60778131 and 60778731 on chr 10 (CDK1). This is a possible way to filter for all the methylated DNA in a range. This is also filtered for 3 selected groups.

InĀ [6]:
all_groups = pl.DataFrame({"group_name": df["group_name"].unique()})

test_agg: pl.DataFrame = (
    test
    .select(["group_name", "frac"])
    .group_by("group_name")
    .agg([pl.len().alias("n methylations")])
    .join(all_groups, on="group_name", how="full")
    .with_columns(pl.col("group_name").fill_null(pl.col("group_name_right")))
    .drop("group_name_right") 
    .fill_null(0)
)
print(test_agg)
shape: (5, 2)
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ group_name           ┆ n methylations │
│ ---                  ┆ ---            │
│ str                  ┆ u32            │
ā•žā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•”
│ Jurkat_betuline1     ┆ 0              │
│ Jurkat_betuline2     ┆ 39             │
│ Jurkat_DMSO_control1 ┆ 0              │
│ Jurkat_DMSO_control2 ┆ 0              │
│ Healthy_control1     ┆ 44             │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

This df contains the number of methylations in the range specified in the test df. It appears that there's only methylations for the healthy control group and 1 of the 2 betuline control groups. These results overlap with the results found by the students for this gene (CDK1).

InĀ [7]:
sns.set_theme()

sns.barplot(data = test_agg.sort("n methylations", descending=True),
            y = "group_name", x = "n methylations",
            hue="group_name", palette="Set2")
plt.show()
No description has been provided for this image

This plot showcases the amount of methylations for a certain gene (CDK1). The x-axis holds the amount of methylations and the y-axis holds the group it belongs to.

It appears that there's only methylated DNA for 2 groups.

These results overlap with the research and processing of the data that the research students have done.

25-11-2024¶

The following thing i would like to check is if the difference between end and start are always 1. This is to check if there is any possibly faulty data.

InĀ [8]:
start_end_diff: pl.DataFrame = (
    df
    .filter(pl.col("end") - pl.col("start") != 1)
)
print(start_end_diff)
shape: (0, 6)
ā”Œā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ chr ┆ start ┆ end ┆ frac ┆ valid ┆ group_name │
│ --- ┆ ---   ┆ --- ┆ ---  ┆ ---   ┆ ---        │
│ str ┆ i64   ┆ i64 ┆ f64  ┆ i64   ┆ str        │
ā•žā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•”
ā””ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

This ouput means that the difference between end and start is always equal to 1. This means that there are no faulty positions.

I would like to see if there are any major differences between the different groups and the amount of methylated DNA.

InĀ [9]:
df_n_methylation: pl.DataFrame = (
    df
    .select("group_name")
    .group_by("group_name")
    .agg([pl.len().alias("n methylations")])
)
print(df_n_methylation.head())
shape: (5, 2)
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ group_name           ┆ n methylations │
│ ---                  ┆ ---            │
│ str                  ┆ u32            │
ā•žā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•”
│ Jurkat_betuline1     ┆ 1994070        │
│ Healthy_control1     ┆ 14820163       │
│ Jurkat_DMSO_control1 ┆ 2541070        │
│ Jurkat_DMSO_control2 ┆ 2272843        │
│ Jurkat_betuline2     ┆ 2654900        │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

This table holds the total amount of methylations. Visualising this table would make it easier to see any possible differences.

InĀ [10]:
sns.barplot(data = df_n_methylation,
            y = "group_name", x = "n methylations",
            hue="group_name", palette="Set2",).set(
                title="Amount of methylations for all groups",
                xlabel="Amount of methylations", ylabel = "Group name")
plt.show()
No description has been provided for this image

This plot visualises the amount of methylations for every group that is part of the experiment. The x-axis holds the number of methylations, while the y-axis holds the name of the group that the number belongs to.

This plot clearly showcases that the healthy control group has way more methylations then the other groups. This implies that the other groups might have some sort of effect on the methylation. It is unclear if this is the betuline, the DMSO control group also appears to impact the methylation.

I could possibly zoom more into to other groups, to visualise the differences between the treated groups.

InĀ [11]:
sns.barplot(data = df_n_methylation.filter(pl.col("group_name") != "Healthy_control1"),
            y = "group_name", x = "n methylations",
            hue="group_name", palette="Set2",).set(
                title="Amount of methylations for all treated groups",
                xlabel="Amount of methylations", ylabel = "Group name")
plt.show()
No description has been provided for this image

This plot visualises the amount of methylations for every group that is part of the experiment. The x-axis holds the number of methylations, while the y-axis holds the name of the group that the number belongs to.

This visualises that there does not appear to be a pattern between and inside of groups.

BED file annotation¶

26-11-2024¶

To make data selection i need to do the following:

  • Give the given bedfile gene symbols to easier look for specific genes

I will clean the bedfile first, it is very inconsistant with tabs and spaces.

InĀ [12]:
path_bed = "/commons/Themas/Thema06/Methylatie/RRMS_human_hg38.bed"
file_cleaned = []
with open(path_bed, 'r') as bed_file:
    for line in bed_file:
        replaced_line = re.sub(r"\s+", "\t", line.strip())

        file_cleaned.append(replaced_line)
with open("../data/new_bed_file.bed", "w") as new_bed:
    new_bed.write("chr\tstart\tend\n")
    new_bed.write("\n".join(file_cleaned))
InĀ [13]:
bed_df = pl.from_pandas(pd.read_csv("../data/new_bed_file.bed", sep="\t"))
print(bed_df)
shape: (18_069, 3)
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ chr   ┆ start    ┆ end      │
│ ---   ┆ ---      ┆ ---      │
│ str   ┆ i64      ┆ i64      │
ā•žā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•”
│ chr1  ┆ 24735    ┆ 33737    │
│ chr1  ┆ 131124   ┆ 139563   │
│ chr1  ┆ 195251   ┆ 204121   │
│ chr1  ┆ 364792   ┆ 386185   │
│ chr1  ┆ 487107   ┆ 495546   │
│ …     ┆ …        ┆ …        │
│ chr18 ┆ 59898996 ┆ 59900196 │
│ chr19 ┆ 47220224 ┆ 47221024 │
│ chr11 ┆ 798884   ┆ 799484   │
│ chr10 ┆ 60778131 ┆ 60778731 │
│ chr17 ┆ 7667421  ┆ 7668621  │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

This mart_export.txt should hold the genomic sites for genes, let's load it into the session so that we could possibly filter the promoter regions out of it

InĀ [14]:
biomart_bed = pl.from_pandas(pd.read_csv("/homes/rreilman/Downloads/mart_export.txt", low_memory=False))
biomart_bed = biomart_bed.select(["Chromosome/scaffold name", "Gene start (bp)", "Gene end (bp)", "Gene name"])
biomart_bed = biomart_bed.rename({
    "Chromosome/scaffold name": "chr",
    "Gene start (bp)": "start",
    "Gene end (bp)": "end",
    "Gene name": "gene_name"
})
biomart_bed = biomart_bed.with_columns(
    pl.col("gene_name").fill_null("unknown gene")
)
print(biomart_bed.head())
shape: (5, 4)
ā”Œā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ chr ┆ start ┆ end  ┆ gene_name │
│ --- ┆ ---   ┆ ---  ┆ ---       │
│ str ┆ i64   ┆ i64  ┆ str       │
ā•žā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•”
│ MT  ┆ 577   ┆ 647  ┆ MT-TF     │
│ MT  ┆ 648   ┆ 1601 ┆ MT-RNR1   │
│ MT  ┆ 1602  ┆ 1670 ┆ MT-TV     │
│ MT  ┆ 1671  ┆ 3229 ┆ MT-RNR2   │
│ MT  ┆ 3230  ┆ 3304 ┆ MT-TL1    │
ā””ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
InĀ [15]:
def annotate_bed(bed_df: pl.DataFrame, annotate_df: pl.DataFrame):
    new_df = []
    for promoter in bed_df.iter_rows():
        chr_promoter, start_promoter, end_promoter = promoter

        overlaps = (annotate_df
                    .filter(
                        (pl.col("chr") == chr_promoter.replace("chr", "")) &
                        (pl.col("start") <= end_promoter) &
                        (pl.col("end") >= start_promoter)
                        )
                    )
        if not overlaps.is_empty():
            for gene in overlaps["gene_name"].to_list():
                new_df.append({"chr":chr_promoter,
                           "start":start_promoter,
                           "end":end_promoter,
                          "gene_name":gene})

        else:
            new_df.append({"chr":chr_promoter,
                           "start":start_promoter,
                           "end":end_promoter,
                          "gene_name":"Unknown gene"})
            
    
    
    return pl.DataFrame(new_df)

This function should get the promoter locations from a bed-file and return those regions in a new dataframe

It appeared that genes, like CDK1, that the students need are not in the mart file, so we cannot use that file. Martijn told me about annotation from GBFF file, im going to look into that.

28-11-2024¶

i'm going to look at annotating my bed file via a GBFF file from NCBI.

InĀ [17]:
gbff_file = "/homes/rreilman/jaar2/ncbi_dataset/data/GCF_000001405.26/genomic.gbff"
gene_information = []
# Open genomic file
for record in SeqIO.parse(gbff_file, "genbank"):
    chr_name = record.id
    for feature in record.features:
        # Get the ranges
        if feature.type == "gene":
            start = int(feature.location.start)
            end = int(feature.location.end)
            gene_name = feature.qualifiers.get("gene", ["Unknown"])[0]
            gene_information.append({"chr": chr_name,
                                     "start":max(0, start-1000),
                                     "end":end,
                                     "gene_name":gene_name})
gbff_gene_df = pl.DataFrame(gene_information)
print(gbff_gene_df.filter(pl.col("gene_name") == "CDK1"))
shape: (1, 4)
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ chr          ┆ start    ┆ end      ┆ gene_name │
│ ---          ┆ ---      ┆ ---      ┆ ---       │
│ str          ┆ i64      ┆ i64      ┆ str       │
ā•žā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•”
│ NC_000010.11 ┆ 60771975 ┆ 60794852 ┆ CDK1      │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

This df contains the chr start end and name of every gene in the GBFF file The chromosome naming convention does not match the way our bed file is made (chr*) so i will have to change that.

InĀ [18]:
chrome_mapping = pl.from_pandas(pd.read_csv("/homes/rreilman/jaar2/chromosome_mapping.csv", delimiter="\t"))
print(chrome_mapping)
shape: (455, 2)
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ RefSeq seq accession ┆ Chromosome name │
│ ---                  ┆ ---             │
│ str                  ┆ str             │
ā•žā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•”
│ NC_000001.11         ┆ 1               │
│ NC_000002.12         ┆ 2               │
│ NC_000003.12         ┆ 3               │
│ NC_000004.12         ┆ 4               │
│ NC_000005.10         ┆ 5               │
│ …                    ┆ …               │
│ NT_187685.1          ┆ 19              │
│ NT_187686.1          ┆ 19              │
│ NT_187687.1          ┆ 19              │
│ NT_113949.2          ┆ 19              │
│ NC_012920.1          ┆ MT              │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

This df contains the NCBI naming convention and the way i have named my chromosomes.

InĀ [19]:
gbff_gene_df_updated = gbff_gene_df.join(chrome_mapping, left_on="chr", right_on="RefSeq seq accession", how="left")
gbff_gene_df_updated = gbff_gene_df_updated.with_columns(
    pl.when(pl.col("Chromosome name").is_not_null())
    .then(pl.col("Chromosome name"))
    .otherwise(pl.col("chr"))
    .alias("chr")
).select(["chr", "start", "end", "gene_name"])
print(gbff_gene_df_updated.filter(pl.col("gene_name") == "CDK1"))
shape: (1, 4)
ā”Œā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ chr ┆ start    ┆ end      ┆ gene_name │
│ --- ┆ ---      ┆ ---      ┆ ---       │
│ str ┆ i64      ┆ i64      ┆ str       │
ā•žā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•”
│ 10  ┆ 60771975 ┆ 60794852 ┆ CDK1      │
ā””ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

This dataframe now uses our naming convention, instead of NCBI chromosome naming convention. The next step is to annotate out bed dataframe.

InĀ [20]:
annotated_bed_file = annotate_bed(bed_df, gbff_gene_df_updated)
print(annotated_bed_file)
#annotated_bed_file.write_csv("../data/annotated_bed.bed")
shape: (30_681, 4)
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ chr   ┆ start    ┆ end      ┆ gene_name    │
│ ---   ┆ ---      ┆ ---      ┆ ---          │
│ str   ┆ i64      ┆ i64      ┆ str          │
ā•žā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•”
│ chr1  ┆ 24735    ┆ 33737    ┆ WASH7P       │
│ chr1  ┆ 24735    ┆ 33737    ┆ MIR1302-2    │
│ chr1  ┆ 24735    ┆ 33737    ┆ FAM138A      │
│ chr1  ┆ 24735    ┆ 33737    ┆ LOC102724250 │
│ chr1  ┆ 24735    ┆ 33737    ┆ TRNAN-GUU    │
│ …     ┆ …        ┆ …        ┆ …            │
│ chr19 ┆ 47220224 ┆ 47221024 ┆ BBC3         │
│ chr11 ┆ 798884   ┆ 799484   ┆ PANO         │
│ chr11 ┆ 798884   ┆ 799484   ┆ PIDD         │
│ chr10 ┆ 60778131 ┆ 60778731 ┆ CDK1         │
│ chr17 ┆ 7667421  ┆ 7668621  ┆ TP53         │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

I will now create the same barplot as before, to check if the result is the same.

InĀ [21]:
def get_gene_info(genes_list: list[str], annotated_bed_df):
    df_wanted = (annotated_bed_df
                 .filter(pl.col("gene_name").is_in(genes_list)))
    return df_wanted
df_cdk1 = get_gene_info(["CDK1"], annotated_bed_file)
print(df_cdk1)
shape: (2, 4)
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ chr   ┆ start    ┆ end      ┆ gene_name │
│ ---   ┆ ---      ┆ ---      ┆ ---       │
│ str   ┆ i64      ┆ i64      ┆ str       │
ā•žā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•”
│ chr10 ┆ 60774212 ┆ 60783104 ┆ CDK1      │
│ chr10 ┆ 60778131 ┆ 60778731 ┆ CDK1      │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

This dataframe contains the promoter areas for the CDK1 gene. I will use these to filter the df with the methylation data. And create a barplot

InĀ [22]:
methylation_cdk1 = (df
                    .filter((pl.col("chr").is_in(df_cdk1.select("chr"))) &
                            (pl.col("start") >= df_cdk1.select("start").to_numpy()[0]) &
                            (pl.col("end") <= df_cdk1.select("end").to_numpy()[0])))
print(methylation_cdk1)



test_agg2: pl.DataFrame = (
    methylation_cdk1
    .select(["group_name", "frac"])
    .group_by("group_name")
    .agg([pl.len().alias("n methylations")])
    .join(all_groups, on="group_name", how="full")
    .with_columns(pl.col("group_name").fill_null(pl.col("group_name_right")))
    .drop("group_name_right") 
    .fill_null(0)
)
print(test_agg2.head())

sns.barplot(data = test_agg2.sort("n methylations", descending=True),
            y = "group_name", x = "n methylations",
            hue="group_name", palette="Set2")
plt.show()
shape: (248, 6)
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ chr   ┆ start    ┆ end      ┆ frac ┆ valid ┆ group_name       │
│ ---   ┆ ---      ┆ ---      ┆ ---  ┆ ---   ┆ ---              │
│ str   ┆ i64      ┆ i64      ┆ f64  ┆ i64   ┆ str              │
ā•žā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•”
│ chr10 ┆ 60774589 ┆ 60774590 ┆ 1.0  ┆ 1     ┆ Jurkat_betuline2 │
│ chr10 ┆ 60774731 ┆ 60774732 ┆ 1.0  ┆ 1     ┆ Jurkat_betuline2 │
│ chr10 ┆ 60775086 ┆ 60775087 ┆ 1.0  ┆ 1     ┆ Jurkat_betuline2 │
│ chr10 ┆ 60775117 ┆ 60775118 ┆ 1.0  ┆ 1     ┆ Jurkat_betuline2 │
│ chr10 ┆ 60775225 ┆ 60775226 ┆ 1.0  ┆ 1     ┆ Jurkat_betuline2 │
│ …     ┆ …        ┆ …        ┆ …    ┆ …     ┆ …                │
│ chr10 ┆ 60780807 ┆ 60780808 ┆ 1.0  ┆ 2     ┆ Healthy_control1 │
│ chr10 ┆ 60780818 ┆ 60780819 ┆ 1.0  ┆ 2     ┆ Healthy_control1 │
│ chr10 ┆ 60781020 ┆ 60781021 ┆ 1.0  ┆ 3     ┆ Healthy_control1 │
│ chr10 ┆ 60781482 ┆ 60781483 ┆ 1.0  ┆ 4     ┆ Healthy_control1 │
│ chr10 ┆ 60781532 ┆ 60781533 ┆ 1.0  ┆ 6     ┆ Healthy_control1 │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
shape: (5, 2)
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ group_name           ┆ n methylations │
│ ---                  ┆ ---            │
│ str                  ┆ u32            │
ā•žā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•”
│ Jurkat_betuline1     ┆ 0              │
│ Jurkat_betuline2     ┆ 123            │
│ Jurkat_DMSO_control1 ┆ 0              │
│ Jurkat_DMSO_control2 ┆ 0              │
│ Healthy_control1     ┆ 125            │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
No description has been provided for this image

This outputs a similar result to the first barplot The only difference is that the annotation added an extra promoter area that has been connected to the CDK1 gene. This is usable for now, but i'll have to ask Martijn if it is correct. The next steps will be cleaning up a bit of my code.

The way the main df is being filtered is currently incorrect, i will have to create a function for this to make it more usable

03-12-2024¶

I noticed that i copied code to select data, i'm going to create a function for this, so that it'll be easier. This function will take a filtered promoter_bed_df with annotated gene named, filtered on the genes of interest. The function uses that dataframe to find all methylation data in those promoter regions.

InĀ [23]:
def filter_data_from_bed(main_df, promoter_df):
    final_subsetted_df: pd.DataFrame = pl.DataFrame(
        {"chr":[],
         "start":[],
         "end":[],
         "frac":[],
         "valid":[],
         "group_name":[]}
    )
    for row in promoter_df.iter_rows():
        (chromosome, promoter_start, promoter_end, gene_name) = row
        subsetted_df = main_df.filter(
            (pl.col("chr") == chromosome) &
            (pl.col("start") >= promoter_start) &
            (pl.col("end") <= promoter_end)
        )
        final_subsetted_df = pl.concat([subsetted_df, final_subsetted_df])
        
    return final_subsetted_df
cdk1_test_df = filter_data_from_bed(df, df_cdk1)
print(cdk1_test_df.head())
shape: (5, 6)
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ chr   ┆ start    ┆ end      ┆ frac ┆ valid ┆ group_name       │
│ ---   ┆ ---      ┆ ---      ┆ ---  ┆ ---   ┆ ---              │
│ str   ┆ i64      ┆ i64      ┆ f64  ┆ i64   ┆ str              │
ā•žā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•”
│ chr10 ┆ 60778212 ┆ 60778213 ┆ 0.0  ┆ 1     ┆ Jurkat_betuline2 │
│ chr10 ┆ 60778217 ┆ 60778218 ┆ 0.0  ┆ 1     ┆ Jurkat_betuline2 │
│ chr10 ┆ 60778237 ┆ 60778238 ┆ 0.0  ┆ 1     ┆ Jurkat_betuline2 │
│ chr10 ┆ 60778258 ┆ 60778259 ┆ 0.0  ┆ 1     ┆ Jurkat_betuline2 │
│ chr10 ┆ 60778283 ┆ 60778284 ┆ 0.0  ┆ 1     ┆ Jurkat_betuline2 │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

This dataframe will contain all methylated positions that fall into the promoter region(s) of the CDK1 gene. This is a good way to get methylation information for specific genes.

04-12-2024¶

To see if this works, i will create a function that can count the amount of methylation data.

InĀ [24]:
def count_methylation_data(main_df):
    return (main_df
    .select(["group_name", "frac"])
    .group_by("group_name")
    .agg([pl.len().alias("n methylations")])
    .join(all_groups, on="group_name", how="full")
    .with_columns(pl.col("group_name").fill_null(pl.col("group_name_right")))
    .drop("group_name_right") 
    .fill_null(0)
)
    
cdk1_count_data = count_methylation_data(cdk1_test_df)
cdk1_count_data
Out[24]:
shape: (5, 2)
group_namen methylations
stru32
"Jurkat_betuline1"0
"Jurkat_betuline2"162
"Jurkat_DMSO_control1"0
"Jurkat_DMSO_control2"0
"Healthy_control1"169

This table can be used as data for barplot, to showcase how many methylations every group has for the selected gene(s). I'm going to create a function that will plot this data into a bar plot

InĀ [25]:
def code_count_data(data_df):
    sns.barplot(data = data_df.sort("n methylations", descending=True),
            y = "group_name", x = "n methylations",
            hue="group_name", palette="Set2")
    plt.show()
code_count_data(cdk1_count_data)
No description has been provided for this image

What this showcases is that there were multiple promoter sites found for the CDK1, sites that possibly weren't found by Martijn. I'm still going to use this method, since it is mostly correct. I also want to try different plotting libraries, this to see if there are other, better ones.

I read about plotly, it generates interactive plots. I'm going to create a barplot, and see if i like the way it looks and feels.

InĀ [26]:
fig = px.bar(cdk1_count_data, y="n methylations", x="group_name", color="group_name",
             title="Amount of methylations for the CDK1 gene")
fig.show()

I like that this barplot is interactive, that will make reading it at higher counts easier. I don't like the look much of it. That would be a matter of personal preference.

Plotting / visualisation testing and ideas¶

09-12-2024¶

I did some more research into plotting today, and i stumbled into hvplots. hvplots can be used to create super customisable interactive plots, which i will use for the dashboard. It should also be a bit faster

InĀ [27]:
def plot_barchart(df: pl.DataFrame):
    barplot = df.hvplot.bar(x = "group_name", y="n methylations",
                  color = "group_name", cmap = "Category10", width = 900)
    return barplot
    
plot_barchart(cdk1_count_data)
Out[27]:

This function generates a nice looking barplot, that is also interactive. Perfect for the dashboard!

10-12-2024¶

Now i have to see what other plots i can use to visualise The scatter plot might be usuable to plot the specific methylation points. This should not be a fast plot, with all of the data, so the user must filter on range, chr and group to speed it up.

InĀ [28]:
def plot_scatter(df):
    df = (df.
          filter(pl.col("chr").is_in("chr"+chrome_mapping["Chromosome name"].unique())))
    df = df.sample(fraction=0.02)
    return df.hvplot.scatter(x = "start", y="chr", by="group_name", width = 900)

plot_scatter(df)
Out[28]:

At a first glace this plot looks like a mess of grey, but once filtered on chr, groups and ranges it'll be a lot more readable. Thanks to the interactivity of hvplot the user will be able to zoom into the plot, and look at more specific ranges of the genome. The colors will differentiate the groups. Not sure if it is usable, will test it in the panel site.

InĀ [29]:
def plot_violin(df):
    return  df.hvplot.violin(y = "start", by='group_name', c="group_name", width = 900)
plot_violin(df)
Out[29]:

This plot does not really work, It also calculates values like the mean and IQR. Those values are useless for genomic locations. Which would mean that most plots that are calculating statistics-based values will not work.

InĀ [30]:
def plot_kde(df):
    df = df.select(["group_name", "start"])
    return df.hvplot.kde(by="group_name")

plot_kde(df)
Out[30]:

This plot showcases the density of methylations over all chromosomes. The df could be filtered on chromosome and range to get a more specific view. But could absolutely work in a website

I will now start working on the panel application, and i will test the functions/plots in the web-space. This way i can also test and improve performance.